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Abstract 

Background: In contrast to international pig breeds, the Iberian breed has not been admixed with Asian 
germplasm. This makes it an important model to study both domestication and relevance of Asian genes in the 
pig. Besides, Iberian pigs exhibit high meat quality as well as appetite and propensity to obesity. Here we provide a 
genome wide analysis of nucleotide and structural diversity in a reduced representation library from a pool (n=9 
sows) and shotgun genomic sequence from a single sow of the highly inbred Guadyerbas strain. In the pool, we 
applied newly developed tools to account for the peculiarities of these data. 

Results: A total of 254,106 SNPs in the pool (79.6 Mb covered) and 643,783 in the Guadyerbas sow (1.47 Gb 
covered) were called. The nucleotide diversity (1.31x10~ 3 per bp in autosomes) is very similar to that reported in 
wild boar. A much lower than expected diversity in the X chromosome was confirmed (1 .79x1 0" 4 per bp in the 
individual and 5.83x1 0" 4 per bp in the pool). A strong (0.70) correlation between recombination and variability was 
observed, but not with gene density or GC content. Multicopy regions affected about 4% of annotated pig genes 
in their entirety, and 2% of the genes partially. Genes within the lowest variability windows comprised interferon 
genes and, in chromosome X, genes involved in behavior like HTR2C or MCEP2. A modified Hudson-Kreitman 
-Aguade test for pools also indicated an accelerated evolution in genes involved in behavior, as well as in 
spermatogenesis and in lipid metabolism. 

Conclusions: This work illustrates the strength of current sequencing technologies to picture a comprehensive 
landscape of variability in livestock species, and to pinpoint regions containing genes potentially under selection. 
Among those genes, we report genes involved in behavior, including feeding behavior, and lipid metabolism. The 
pig X chromosome is an outlier in terms of nucleotide diversity, which suggests selective constraints. Our data 
further confirm the importance of structural variation in the species, including Iberian pigs, and allowed us to 
identify new paralogs for known gene families. 
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Background 

The pig is one of the most important sources of meat 
worldwide, as well as a relevant biomedical model for 
some diseases like metabolic syndrome or obesity [1,2]. 
Current high throughput sequencing technologies, to- 
gether with the recent completion of porcine s genome 
and its annotation [3], makes it possible to study the 
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genomic variability of specific breeds with a detail that 
was not possible until now. Here, we present a thorough 
genomewide analysis of the Iberian breed. Commercial 
pig breeds that are today exploited internationally, e.g., 
Landrace, Large White or Duroc, are the result of 
introgressing local primigenious European breeds with 
Asian germplasm, a process that is now well docu- 
mented [4,5]. In contrast, European wild boars, as well 
as local Mediterranean breeds like the Iberian breed, 
were not affected by this admixture process. Given the 
high divergence between Asian and primigenious 
European pigs, ca. 1 MYA [3], and the extent and inten- 
sity of modern selection methods, the study of Iberian 
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pigs can illuminate both the domestication process and 
the influence of Asian germplasm in the shaping of 
current international pig breeds. Besides, Iberian pigs 
are important economically because of their high meat 
quality and resilience to endure harsh environmental 
conditions [6]. They are very fat pigs, markedly different 
from modern lean pigs, and are interesting from a hu- 
man biomedical perspective because they present high 
feed intake and propension to obesity, compatible with 
high values of serum leptin [7]. 

Here, we describe a genomic analysis of the Iberian 
breed using a mixed approach: a reduced representation 
library (RRL, [8]) sequencing of a pool of nine sows, and 
a shotgun complete genome sequencing of a highly in- 
bred Iberian strain (Guadyerbas). The latter strain has 
been used in numerous QTL experiments and has been 
maintained in isolation for over 68 years and 25 genera- 
tions in a closed herd, El Deheson del Encinar, located in 
Toledo, central Spain [9]. In a previous work [10], we 
reported a partial RRL sequencing of the same sow, 1% 
of the genome approximately. The pool is made up of 
Iberian pigs from farms with strict pedigree control and 
that represent the extant diversity of Iberian varieties. 
The pool included as well the Guadyerbas sow that was 
individually sequenced. 

Results 

Nucleotide variability 

Out of two paired-end (PE) lanes from a reduced repre- 
sentation library in the pool, about 3% of the current pig 
assembly v 10.2 was covered with depth between 3x and 
30x. From one PE and one single end (SE) lane of the 
Guadyerbas sow, ~ 60% of the genome was covered with 
depths 3x - 20 x. Average depths in the pool and in the 
individual were 14x and 7x, respectively. These statistics 
result from filtering reads by a minimum mapping qual- 
ity of 20 with samtools, as suggested to remove ambigu- 
ous mapping (http://samtools.sourceforge.net/). 

In order to better interpret the results of the pool de- 
sign and be able to quantify how much variability is 
likely to be uncovered by sequencing the pool, we ran a 
simulation study mimicking as much as possible the 
pool process and the bioinformatics pipeline we used in 
the analyses of real data (see Materials and methods). 
These simulations suggested that we should detect ~ 
47% of all SNPs actually segregating in the nine individ- 
uals and with a low false SNP discovery rate (0.02) for 
regions covered with a depth of 3-20x. Additional file 1 
shows simulated results by minor allele frequency 
(MAF) and depth. Note that the majority of SNPs that 
are missed is due to their low frequency: while 80% of 
SNPs at MAF < 0.1 are likely undetected, the power for 
SNPs with MAF 0.3 is 60% and approaches 100% at 
higher MAFs. Importantly, the statistics used here to 



infer nucleotide variability were developed to account 
for the bias towards intermediate allele frequency in the 
pooling process (see Materials and methods). 

In all, the raw numbers of SNPs (only segregating 
sites) called using criteria described in methods were 
254,106 in the pool (79.6 Mb covered) and 643,783 in 
the Guadyerbas sow (1.47 Gb covered). The full SNP list 
is available on request from the authors. A total of 17.7 
Mb of the current assembly was covered in both the 
pool and the individual, and 10,324 SNPs were called 
in both designs. The raw number of fixed differences 
between the assembly, primarily a Duroc female, and 
the Iberian pool was 152,225, and 2,503,645 for the 
Guadyerbas. We also detected 49,105 heterozygous 
indels and 316,189 fixed indels in the individual sow. 
We did not call indels in the pool because indel calling 
algorithms are not specific for pools and can be mislead- 
ing. SNP annotation by autosomes, pseudoautosomal re- 
gion (PAR) and non-pseudoautosomal region (NPAR) of 
the X chromosome (SSCX) is detailed in Table 1. SNP 
classes are ranked in decreasing order of severity of their 
predicted functional consequences, according to variant 
effect predictor ensembl pipeline [11]. Note, neverthe- 
less, that these raw numbers are not directly comparable 
between the pool and the individual because of the (un- 
known) different number of individuals actually se- 
quenced in the pool in each region, read depth and 
alignment lengths. 

We computed Wattersons estimates of diversity, 6, 
corrected for pooling and low depth (as detailed in 
methods and in [12]) in non overlapping windows of 
200 kb length. In general, there was a moderate correl- 
ation between pool and individual variabilities (Pearson 
correlation = 0.45, Figure 1) when windows with no SNP 
in the Guadyerbas are removed. Nevertheless, it should 
be reminded that the Guadyerbas strain is highly inbred, 
e.g., we found that ~ 10% of the 200 kb windows were 
devoid of any SNP. Another factor of bias is that, while 
an RRL was sequenced in the pool (3% of the genome), 
the Guadyerbas sow was shotgun sequenced (60% gen- 
ome aligned). Our results suggest a positive correlation 
in nucleotide diversity among nearby genome regions 
for the 17.7 Mb that were covered in both the pool and 
the individual (Figure 1). 

Wattersons 6 are plotted in Figure 2 in 200 kb win- 
dows for both the pool and the individual. In agreement 
with results from [12,13] and [10], variability increased 
towards telomeric regions. This suggests a marked effect 
of recombination in variability. To explore this issue fur- 
ther, we plotted variability vs. recombination rate [14] in 
5Mb, 10Mb and 20Mb window sizes (Figure 3), observ- 
ing a correlation of 0.53, 0.62 and 0.70, respectively. Cor- 
relation increased with window size, probably because 
the genetic maps were obtained from a pedigree with 
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Table 1 Number of SNPs by annotation class and genome region: autosomes, X chromosome non-pseudoautosomal 
region (NPAR) and X pseudoautosomal region (PAR) 



Consequence 


Autosomes 
Guadyerbas 


Autosomes 
Iberian pool 


NPAR 
Guadyerbas 


NPAR Iberian 
pool 


PAR 
Guadyerbas 


PAR Iberian 
pool 


Essential splice site 


30 


30 


1 


1 


0 


0 


Stop gained 


44 


11 


2 


0 


0 


0 


Stop gained,splice site 


1 


0 


0 


0 


0 


0 


Stop lost 


4 


15 


0 


0 


0 


0 


Non synonymous coding 


2650 


1222 


40 


28 


1 


0 


Non synonymous coding,splice site 


51 


31 


0 


1 


0 


0 


Synonymous coding,splice site 


49 


24 


3 


0 


0 


0 


Splice sitejntronic 


282 


254 


10 


8 


0 


1 


5prime utr,splice site 


1 


1 


0 


0 


0 


0 


3prime utr,splice site 


2 


0 


0 


0 


0 


0 


Within non coding gene,splice site 


7 


1 


0 


0 


0 


0 


Synonymous coding 


2676 


1611 


33 


34 


0 


1 


Coding unknown 


8 


4 


0 


0 


0 


0 


Within mature mirna 


1 


1 


0 


2 


0 


0 


5prime utr 


193 


418 


0 


7 


0 


0 


3prime utr 


2103 


1357 


23 


19 


0 


0 


Intronic 


148468 


78279 


1867 


1204 


133 


147 


Within non coding gene 


286 


99 


12 


3 


0 


0 


Within non coding genejntronic 


6 


3 


0 


0 


0 


0 


Upstream 


34314 


15216 


426 


346 


20 


16 


Downstream 


34395 


15737 


628 


346 


24 


14 


Intergenic 


433720 


150572 


7161 


3620 


3087 


1214 


Total 


659291 


264886 


10206 


5619 


3265 


1393 


Annotation terms shown are in decreasing order of severity, as estimated by Ensembl. 



few generations and therefore small genetic distances 
are subject to large sampling errors [14]. We also corre- 
lated variability with other factors that have been 
reported to affect variability, namely GC content and 
gene density [15], and results are in Table 2. Recombin- 
ation rate was still the main factor affecting variability. 
Although GC content was also significant, its condi- 
tional effect was slightly negative, likely because of co- 
linearity. If a model was fitted with only GC content, the 
effect was positive although the model explained much 
lower variability than a model with only recombination 
rate (results not presented). 

In agreement with previous results [10], we observed a 
marked reduced variability in chromosome X NPAR 
relative to the expected value, which is 3/4 that of auto- 
somes; this reduction was more pronounced in the in- 
bred Guadyerbas individual than in the pool (Table 3). 
Note that SSCX is divided in PAR and NPAR regions, 
which exhibit quite distinct patterns of variability. The 
high variability regions in the telomeres correspond to 
the PAR. In fact, variability in the PAR is over 10 times 



higher than in NPAR for the Guadyerbas sow. Although 
the porcine PAR is small (~7 Mb) and diversity estimates 
are subject to larger errors, the difference between PAR 
and NPAR variabilities is dramatic. 

Multicopy regions 

Given the increasing awareness of the importance of 
structural variants in the genome, we also sought to un- 
cover these in the Iberian pigs. In fact, one of the advan- 
tages of resequencing vs. genotyping is that the former 
allows a more precise detection of structural variants in 
the genome than the latter approach, as discovery of var- 
iants with arrays depend on the specific probes used to 
manufacture the chip. Here, we employed an excess of 
read density (depth) method to uncover multicopy 
regions (MCRs, as detailed in methods and in (Paudel 
Y, Madsen O, Megens H-J, Frantz LAF, Bosse M, 
Bastiaansen JWM, Crooijmans RPMA, Groenen MAM: 
Evolutionary dynamics of copy number variation in pig 
genomes in the context of adaptation and domestication, 
submitted). We refer to multicopy regions rather than 
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0.005 



0.010 



0.015 



Guadyerbas variability 

Figure 1 Correlation of Watterson's theta estimates of 
nucleotide variability (per bp) between the individual 
(Guadyerbas) and the Iberian pool across non overlapping 
windows of 200 kb length. Each dot corresponds to theta 
estimates in individual and pool for the same window. 



copy number variants because we analyzed a single indi- 
vidual and we do not have information on whether that 
multicopy region is actually fixed or segregating in the 
population. The draft status of the current porcine gen- 
ome assembly does not allow accurate ascertainment of 
other kinds of variants (e.g., inversions, novel insertions, 
translocations) using aberrant paired-end distance 
methods. MCRs detection is based on read density and 



is therefore less sensitive to mis-assemblies in the 
reference genome. We analyzed only the individual 
Guadyerbas sow because of the uncertainty in the num- 
ber of chromosomes actually sequenced for the pool in 
any given region. Due to limited read depth, we consid- 
ered only gains with respect to reference genome rather 
than gains and losses. 

We found 3,082 outlier regions potentially caused by 
MCRs in the Guadyerbas genome. These were distrib- 
uted among 1,653 windows and spanned 30.5 Mb. The 
majority of the MCR are short (less than 20 kb) and only 
two are longer than 100 kb (Figure 4). These MCRs 
affect 4% of the annotated pig genes in their entirety 
(100% of the gene length) and 2% of the genes partially 
(>50% of their gene length). Barring for errors in the ref- 
erence assembly, therefore, MCRs seem to be an import- 
ant source of variability in the pig, as also observed in 
other species [16]. Distribution of the MCRs along the 
chromosomes is represented in Figure 5. We observed 
a positive correlation between nucleotide variability 
(Watterson's 6) inside the MCRs and the nucleotide 
variability within the 200 kb windows containing MCRs 
but outside MCR boundaries (Pearson correlation = 0.6, 
Additional file 2). Average variability inside MCRs was 
1.51 xlO 3 , somewhat higher than MCR windows but 
outside MCRs boundaries (9.09 xlO 4 ), whereas windows 
devoid of MCRs had the lowest average diversity 
(8.42xl0~ 5 ), suggesting that windows with high nucleo- 
tide variation are enriched in MCRs (Summary statistics 
in Table 4). On the other hand, we detected no correl- 
ation between the number of copies of a MCR and vari- 
ability within MCRs. 
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Figure 2 Watterson's nucleotide variability (per bp) distribution by chromosome (SSC1-SSC18, SSCX) in the pool (top) and the 
individual (bottom). Each dot represents a 200 kb length window, and each chromosome, in a different color. 
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A total of 696 annotated genes fully fell inside MCRs 
and are therefore more likely to be functional than par- 
tially duplicated genes. Our study allowed us to discover 
novel paralogs of annotated genes, originally absent in 
the Duroc reference assembly. These genes primarily 
belonged to well-known multi-genic superfamilies. The 
most over-represented gene family by far was that of the 
olfactory receptors, comprising a total of 476 genes. The 
chromosomes containing the largest number of olfactory 
genes were SSC2 and SSC7 (Figure 6). These results 
agree with data from the international consortium, who 
found that the pig is one of the species with the largest 
repertoire of olfactory receptors, likely a result of the im- 
portance of scent in this foraging species [3]. Similarly, 
large gene families involved in defense and immune re- 
sponse were over-represented within MCRs; we found 8 
new paralogs of annotated interferons (IFN-a8, IFN-alO, 
IFNa-11, IFNal4, IFN82, IFN86, IFNco2 and IFNo4), 2 
interleukines (ILl-ft, IL1B) and five SLA genes (SLA -3, 
SLA-9, SLA-10, SLA-P1, SLA-DRB1). Several tumor ne- 
crosis factor receptors (TNFR) and T-cell receptors (TR) 
were found as well. Other genes within MCRs were in- 
volved in lipid (ACOT4, GPAT2) or carbohydrate metabol- 
ism (5 new paralogs of the UGT2B family and 8 salivary 
and pancreatic amylases), detoxification (CYP2C33 and 
CYP4A21), pheromone binding (PHFROA and PHEROC), 
perception of taste (VN1R2) and fertilization (SPM1). 
Two genes from the serpin-like clade (Serpina 3-1 and 

Table 2 Multiple regression estimates of recombination 
rate, gene and GC contents on Wattersons' variability 
estimates (across 20 Mb windows) in the Guadyerbas 
individual 



Factor 


Estimate 


SD 


P-value 


Recombination rate (cM/Mb) 


4.32x1 0" 4 


4.1 1x1 0" 5 


2.00x1 0" 1 6 


Average gene length (bp) 


-2.1 7x1 0" 9 


3.77x1 0" 9 


0.57 


GC content (%) 


-3.97x1 0~ 3 


1.1 2x1 0~ 3 


0.64x1 0" 3 



Serpina 3-2), retinol dehydrogenase (RDH16), the 
myostatin gene (MSTN) and the lactase gene (LCI) also 
seem to be present in multiple copies in the pig genome. 
Several small RNAs were also detected: two rRNAs (5S 
ribosomal RNA and 5.8S ribosomal RNA), one snoRNA 
(SCARNA6), one snRNA (111) and two miRNAs. A 
complete list of genes entirely inside MCRs is shown in 
Additional file 3. A gene ontology (GO) enrichment ana- 
lysis of biological processes (see Materials and methods) 
found an over-representation of sensory perception of 
smell (adjusted P value = 2.06xl0~ 117 ), response to virus 
(adjusted P value = 2.99xl0~ 06 ) and xenobiotic metabol- 
ism process (adjusted P value = 1.55xl0~ 02 ). 

Outlier regions and potential selection targets 

A matter of intense research is the study of patterns of 
nucleotide variability in domestic species. Outliers in 
these patterns with respect to the standard neutral 
model can be due to selection and then reveal genes of 
socio - economic interest, as well as helping to under- 
stand the effects of domestication and of artificial selec- 
tion in the genome [3]. A serious challenge is that 
selection does not result in a single obvious signal (e.g., 
a selective sweep) but rather in a diversity of manifesta- 
tions that depend on intensity and age of selective 
process as well as on the demographic history of the 
population [15,17]. Here, we employed a number of tests 
that pinpointed a series of genome regions, potentially 
enriched in non-neutral genes. We also took advantage, 
where possible, of the simultaneous availability of pool 
and individual data. Despite the fact the Guadyerbas 
strain only represents one of the Iberian varieties [9], it 
is conjectured that the strongest selective sweeps will be 
shared across all Iberian strains. 

First, we examined extreme windows in terms of low 
and high variability for the Guadyerbas and the pooled 
data (see Materials and methods). A total of 132 genes 
were annotated within the lowest variability windows 
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Table 3 Nucleotide diversity per bp in 


autosomes 


and X 


chromosome 








Guadyerbas 


Iberian 




individual 


pool 


Autosomes 


6.55x1 0" 4 


1.31X10" 3 


Pseudo-autosomal chromosome X (PAR) 


3.02x1 0~ 3 


2.22x1 0" 3 


Nonpseudo-autosomal chromosome X (NPAR) 


1.79X10" 4 


5.83x1 0" 4 



(Additional file 3). A window in SSC1 was specifically 
enriched in interferon genes (IFNE, IFN-alO, IFNcol, 
IFNo3 and IFNg)4), which are involved in response to 
virus (adjusted enrichment GO P=1.3xl0~ 04 ). Note that 
IFN-alO and IFNo)4 are within MCRs, suggesting that 
those genes have un-annotated paralogs and putatively 
under positive selection. Genes within the lowest vari- 
ability windows in NPAR included genes from the Ras 
oncogene family (RAB33A, RAB39B, RAB39B and 
RAP2C), the SOX3 gene, involved in sex determination, 
face development and pituitary gland development, the 
serotonin receptor HTR2C, involved in anxiety, repro- 
ductive and feeding behavior, the MECP2, with a role in 
behavioral fear response, as well as genes involved in 
lipid metabolism (i.e., ACSL4, ALG13, ABCD1, PLP1), in 
hair follicle development (NSDHL) and other genes re- 
lated to immune response (lL13Ral, IL1RAPL2). A 
complete list of these genes is in Additional file 3. 

The majority (-80%) of the high variability windows 
contained MCRs identified in the individual sow as de- 
scribed above. To ensure that the high variability found 
is not influenced by MCR, we removed the SNPs inside 
MCRs. The result was that those windows still con- 
served high variability levels, in agreement with results 




i 1 1 1 1 1 r 

0 20 40 60 80 100 120 

MCR length 

Figure 4 Number of multicopy regions (MCR) by specified 
length (kb). 

V / 



in Table 4. The majority of genes in those windows were 
olfactory receptors, hundreds in total, present in gene 
clusters distributed among almost all chromosomes. In 
addition, other gene families were represented, e.g., ATP- 
binding cassette family, zing finger genes, T-cell receptors 
(TR) and SLA genes (mainly located in chromosome 7), 
transmembrane proteins (J MEM family), several small nu- 
cleolar RNAs, solute carrier family genes, protocadherin 
family genes involved in homophilic cell adhesion and cyto- 
chrome family p450 genes (CYP); see Additional file 3 for a 
complete gene list. Note that IL1B and other gene families 
are present in MCRs and also in high variability regions. 

Next, we computed Tajimas D and Fay-Wu s H statis- 
tics, modified to account for the idiosyncrasy of pool 
data (Materials and methods). In principle, Tajimas D 
and Fay-Wus H negative values can be produced by 
positive selection, although Tajimas D is particularly sen- 
sitive also to demographic effects and prone to false pos- 
itives. The correlation between both statistics was 
moderately positive r = 0.28 (Additional file 4). There 
are also an apparent number of windows with negative 
Tajimas D and zero or even positive Fay-Wus H. Al- 
though the interpretation of this is not clear, it might be 
caused by recurrent hitch hiking events [18]. 

We selected the 1% most extreme windows with com- 
bined negative Tajimas D, Fay-Wu s H and low variabil- 
ity Bw (see Materials and Methods and Additional file 3 
for full results). No over-representation of GO were 
detected after correcting by multiple testing. Interesting 
candidate genes inside those windows are involved in 
axonogenesis and synapsis (FOXP1, LRRK2, EHMT2, 
RAB11A, TEKTS, IGF1R, UNC13C, CNTN1, COL9A2, 
AXIN2, CADPS2, HTR6, KCND1, NOVA1, PTEN), Orca- 
dian rhythm (HEBP1, ALB), epithelial cell differentiation, 
keratinization and hair follicle formation (FOXP1, 
IGF1R, HNF1B, PTEN, AXIN2, KRT81, KRT83, KRT84, 
KRT85, KTR86, PRKD1, AC0210066.1), blood vessel 
morphogenesis (PPAP2B, PRKD1), lipid metabolism and 
fat cell differentiation (PPAP2B, VEPH1, RASA4B, 
ATP10B, NEU1, PTEN, SMPD4, ALB), exploratory behav- 
ior (LRRK2), locomotory behavior (APBA2), grooming 
and feeding behavior (NMUR2), response to starvation 
(GAS6, ALB), spermatogenesis, ovulation and sex deter- 
mination (EHMT2, AFP, IGF1R), visual/odor perception 
(OR5P2, LRRK2, VSX1, GRK1), immune response and in- 
flammatory response (CIITA, PRKD1, FOXP1, IGF1R, 
PTX3). Importantly, the LRRK2 gene is a positive regula- 
tor of the dopamine receptor signalling pathway. The 
complete gene list is in Additional file 3. 

Finally, we performed a genomewide Hudson-Kreitman 
-Aguade (HKA) test in the pool data. The NPAR was ana- 
lyzed separately from autosomes and PAR. After correcting 
for multiple testing, only 25 windows (0.23%) with an 
excess of differentiation were significant (Benjamini- 



Esteve-Codina et al. BMC Genomics 2013, 14:148 
http://www.biomedcentral.com/1471 -21 64/1 4/1 48 



Page 7 of 14 



3 
4 
5 
6 
7 
8 
9 

10 



- . I,, 


... 1 1.. 1 1 . 1 1 1 . 


1 L i.i. 


... 1 J.. , 


,L , .1 ll 


.. ..1. II... . 


1 J.I. 1 


. 1... . 1 


II 1 , 111 .1 ..ll 


J LhU . ., 


. lit. 1 IlJ.L.JI IJiL.L, 


1... Li . ..L .... 1 1 


1. , J 










L.i . 


II i.l J i. l.i ..... i I, 


1. 1 


. 1 1 


i 










. I L I l 


I1.1J..1.M 


.. j.. ■ 


I 










. . 1 III 


Li.i iii ,,,,1,11 .i, ,1 


i J.j. 















J„ 


! 1 , .1,1 , J 1 1 




1 


.1 llll ,1. 


i i 1 il J ii II l ii J ill 


i. , . Jujl 1 ...... 


ii. iJ 


1.1. 


. . . Ii. L 


i. Ji. . i .Ll ..mL j .l. I . J. i 


il ,, 


. l-l- 




. . . Jt . .. jLi i ■ l-l 




1 .. 


J., i.i., 


... .. . . 1 1 


. . M ... , 





11 1- 



12 I mI 1 1 iiii iilti ii I i I 



13 ^ 



15 



I.. I 


J I ... I h ... .1 .. I .... l . ..Jl . , J L . JL L 1 i L ll U, L. . -1, 


. .. .J. .1 U . . 


l.l.li I 


..I.I .1 ll .l. l l... . .1. . .1 il 


■ 


i . .1. 




L I. ill 




.... L 




j. 


.l . .!...[. .ll. .1.1 . ..... l-i . .Ii. * ..II . .L Ji 


I, 


Li J 


l.l -II il- 


i 





16 - 

"| 7 "il J. li ■ I ■ I ■! I. ■ I . I . I b I I 



18 L 



■ II l. M l l l I I 



x ..ii .j,. ■■■■L..L. I. ■LLLiiljj t .i.,.lililiij,.i. ..y. Ji .i . ... 



Figure 5 Chromosome positions of multicopy region gains with respect to the reference genome found in the Iberian genome. Each 
vertical red line corresponds to a multicopy region location and the length of the line is proportional to the number of copies. The shortest 
multicopy region was 4 kb long and the longest, 1 17 kb. 



Hochberg [19] False Discovery Rate, FDR < 0.05). Genes 
within these windows were involved in feeding behavior 
(NPW), social behavior (HTT, DVL1), locomotory behavior 
(HTT, SLCGA3), pigmentation (MC1R), hair follicle mor- 
phogenesis (PDGFA), sensory perception of taste {TAS1R3, 
GNG13), perception of sound (AXIN1), circadian rhythm 
(PRKAA2, ADCY1), tumor necrosis factors (TNFSF12A, 
TNFRSF18, TNFRSF4), male gonad development and sper- 
matogenesis {GFER, BOK), fat cell differentiation (SDF4), 
lipid metabolism (DECR2) and several genes in lipid trans- 
port, e.g., ABCA3, was also reported by the International 
Pig Genome Sequencing Consortium [3] being under selec- 
tion. Interestingly, the neuropeptide AXIN1 has been found 
differentially expressed in brains of two extreme groups of 



junglefow in terms of fearfulness [20]. The complete gene 
list is in Additional file 3. 

Only 39 windows (0.36%) with an excess of poly- 
morphism vs. differentiation were significant (HKA test 
False Discovery Rate < 0.05). Several genes inside those 
windows belonged to gene superfamilies {ABC, OR, 
TRIM, Zinc fingers). Interesting genes to mention are 
involved in immune response, e.g., complement system 
genes (C8A, C8B) and swine major histocompatibility 
complex (SLA-DQA1, SLA-DQB*G01, SLA-DRA1, SLA- 
DRB, SLA-DRB1), feeding behavior and synapsis 
(HCRTR2), visual/sound perception and pigment gran- 
ule transport (MY07A), lipid metabolism (PPAP2A, 
PRKAA2), viral infectious cycle (RPS21) or defense 



Table 4 Nucleotide variabilities (SNPs / bp) within and outside multicopy regions (MCRs) in the Guadyerbas individual 

Region Median Mean SD 

Within MCRs 1.67x10" 4 1.52x10" 3 3.46x1 0" 3 

Outside MCRs, within windows containing MCRs 1 .83x1 0" 4 9.1 0x1 0" 4 1 .82x1 0" 3 

Windows without MCRs 8.43x1 0" 5 3.92x1 0" 4 6.1 1 x1 0" 4 
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Kilobases spent in MCR gains with respect to the reference 
genome normalized by chromosome length 
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Figure 6 Total length of multicopy regions (MCR) per chromosome, divided by chromosome length, for each chromosome. 



response (SPACA3) (full gene list in Additional file 3). 
Within the NPAR region of the X, only one window was 
significant. This window contains the SHROOM2, a gene 
involved in brain, eye and ear morphogenesis and pig- 
ment accumulation among others (Additional file 3). 

Discussion 

This study presents a novel combined analysis of pool 
and individual sequencing. Although pools biases the 
SNP discovery process towards common variants and 
have lower power than individual sequencing [21], our 
simulation indicates that we should expect to detect al- 
most half (47%) of all SNPs. Given that there are Y,i = 
i,i7 Hi - 3.4 times more SNPs in 18 chromosomes than 
in a single individual, the pool process uncovers about 
60% more SNPs than individual sequencing - for any 
given region sequenced in common and assuming an 
average depth of 14 x for the pool and 7x in the individ- 
ual. Note that, nevertheless, the allele frequency 
spectrum is different. In pools, the SNP discovery is 
biased against low MAF SNPs, whereas the probability 
of a SNP being discovered in the individual is independ- 
ent of its frequency in the population, assuming a neu- 
tral site frequency spectrum. The reason for this is that, 
although the probability of being heterozygous f(l-f) is 
maximum at frequency f=0.5 (high MAF), low MAF 
SNPs are much more common than high MAF SNPs 
and both effects cancel each other. 

Genomewide variability in the Guadyerbas sow was 
much lower than that in the Iberian pool; 50% and 70% 
lower for autosomes and NPAR, respectively (Table 3). 
Estimates are corrected for the pooling process so 
the large disparity is not due to SNP calling in pools vs. 
individuals but, rather, to the high inbreeding of 



the Guadyerbas strain. Because the pedigree of the 
Guadyerbas is known since the foundation of the herd 
in 1944 [9], the inbreeding coefficient F for the specific 
sow sequenced was estimated from pedigree as F A = 
0.39 and F x = 0.46 for autosomes and NPAR, respect- 
ively. This results in estimates corrected by inbreeding 
tt a * = 6.55xl0" 4 / (1-0.39) = 1.07xl0" 3 and tt x * = 
1.79x10-4 / (1-0.49) = 3.51 xlO" 4 These values are close 
to those obtained from the pool in autosomes but, intri- 
guingly, for NPAR are still 40% lower in the Guadyerbas 
(Table 3). Therefore, inbreeding explains the loss in vari- 
ability in the whole Iberian pig breed for autosomes but 
not in NPAR. 

Remarkably, autosomal nucleotide diversity in the 
Iberian pool (0.0013, Table 3) are comparable to those 
reported in the two European wild boars sequenced by 
the International Pig Genome Sequencing Consortium: 
Heterozygosity He = 0.0012 and 0.0010 [3]. In contrast, 
heterozygosity in international domestic breeds is higher 
(He = 0.0016 or larger) than in European wild boar or 
Iberian pig, likely because of introgression with Asian 
pigs [22]. The fact that Iberian pigs and European wild 
boar diversities are comparable agrees with previous evi- 
dence showing that Iberian pigs have not been 
intercrossed with Asian germplasm [23]. It also stresses 
the relevance of Iberian pig as a model of native Medi- 
terranean domestic pig that should help to disentangle 
the effects of Asian introgression and domestication on 
response to selection by modern breeding. 

Our data further confirm the much lower ob- 
served than expected variability in SSCX (3/4 that of au- 
tosomes) as was previously reported in the partial 
resequencing of the same Guadyerbas sow [10]. Here, 
because we were able to distinguish between PAR and 
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NPAR regions, the X/A ratio is even lower than reported 
before: 0.27 in Guadyerbas and 0.44 in the pool. In con- 
trast, diversity in the PAR was comparable, or even 
higher, than in autosomes. Although demographic effects 
can reduce X/A variability, the effect observed here is 
quite unusual, and seems to be a pervasive property of 
several porcine populations [12]. Selection can be argued 
as an alternative explanation. Genes within the lowest 
variability NPAR windows included ACSL4, which is a 
candidate loci inside a QTL that affects fatty acid com- 
position in the Iberian pig [24,25], HTR2C, which has 
been identified as a genetic loci potentially contributing 
to maternal infanticide in pigs [26], SOX3, which plays 
an important role in testis development and possibly 
sperm maturation [27], MECP2 regulates fear-dependent 
learning and memory [28], a distinctive biological feature 
between wild animals and its domesticated descendants, 
NSDHL is involved in cholesterol biosynthesis but also 
in hair follicle formation, characteristic that has also 
evolved during domestication process, wild pigs are fur- 
rier than domestic pigs. It should be noted that the black 
varieties of Iberian breed, that include the Guadyerbas, 
are hairless and the red varieties present sparse hair. 

The discovery of thousands of new MCRs (>4 kb) with 
respect to the reference genome potentially indicate 
many copy number variants between the Iberian pig and 
the Duroc reference assembly, although part of those 
could be due to a mis-assembled or incomplete refer- 
ence genome in duplicated regions. In agreement with 
our results, Paudel et al, (op. cit) also report many new 
paralogs of existing genes in a diverse panel of pig 
breeds with respect to the reference Duroc assembly. 
The majority of them overlap with our results, except 
for GPAP2, PHEROA, PHEROC and SPM1, which might 
be Iberian specific MCR or only found here due to lim- 
ited sampling in Paudel et al, (op. cit.). The fact that 
some MCRs have high values of nucleotide diversity 
might be caused by an artifact of the mapping (the Iber- 
ian pig presents more copies than the reference and 
therefore ambiguous reads map to the same locus, caus- 
ing false positive SNPs). Nevertheless, the fact that vari- 
ability in regions outside the MCR with respect to the 
assembly but within windows containing MCRs is higher 
than average genomewide (Table 4) might be an indirect 
consequence of increased recombination, which causes 
MCRs as well as increased variability. All in all, the pre- 
cise interaction between recombination rate, variability 
and multicopy regions is currently conjecture. 

To unravel putative genes under selection in the Iber- 
ian pig lineage we applied different selection tests oper- 
ating at different time scales, primarily we focused on 
regions of very low variability, a combined, Tajima's A 
Fay-Wus H and 9 test and the HKA test. Some of the 
candidate genes found with extreme negative values of 



D-H-6 and low 6 or excess of differentiation in the HKA 
test presented ontologies which have been previously 
reported to be under positive selection. Among those, 
genes related with keratinization and epidermis forma- 
tion {D-H-6 test) have been reported to be under adap- 
tive pressures in human and primates, they act as a 
physical barrier defense vis a vis the external environ- 
ment [29-31]. Several studies in mammals and Drosoph- 
ila have reported immunity related genes (evidence from 
6, D-H-6 and HKA tests) as being under strong positive 
selection against rapidly evolving pathogens [32-37]. We 
also identified several genes involved in feeding behavior, 
fear response and social behaviour {D-H-6 and HKA 
tests). Behavior has been reported as one of the bio- 
logical functions subject to selection during the process 
of pig domestication [12,38,39] and feeding behavior and 
response to starvation are, logically, most relevant traits 
in domestication and in breeding. Pigmentation {MCR1 
gene, HKA test), has been reported to be under positive 
selection in pigs due to human interest to cherry-pick 
different coat colors that would otherwise be quickly 
eliminated in the wild [40]. Spermatogenesis genes {D- 
H-6 and HKA tests) have been reported to be rapidly 
evolving genes in Drosophila and in many other organ- 
isms [41]. Finally, lipid metabolism genes {D-H-6 and 
HKA tests) might also have changed, specifically in the 
Iberian breed, conferring its distinctive lipid composition 
and deposition in the meat. 

Finally, an excess of polymorphism in the HKA test or 
extreme high values of 6 are indicative of balancing selec- 
tion but could also be the result of artifacts due to assem- 
bly errors in duplicated regions. Given the widespread 
occurrence of MCRs reported here, further investigations 
in this direction are needed. In particular, improving the 
reference pig assembly, will help to disentangle both ef- 
fects on polymorphism data. As reported in previous stud- 
ies [42-44], genes involved in the perception of smell 
(olfactory receptor family) and genes involved in antigen 
presentation and defense response (e.g., SLA) are inside 
these regions. 

Conclusion 

The recent completion of the porcine sequencing project 
has allowed digging deeper into the complexities of the 
Iberian pig genomes than was possible until now. This 
breed is important because it represents a primigenious 
European breed that, while being domestic, has not been 
introgressed with Asian germplasm. Our data confirm 
the importance of structural variation in the porcine 
species, as also observed in other species. The tests ap- 
plied suggest that many and diverse selective processes 
have occurred in the Iberian pig lineage, among 
them changes in feeding behavior. New bioinformatics 
tools, e.g. that deal with structural variants, as well as 
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projects aimed at complete annotation of the pig gen- 
ome (ENCODE) are needed to improve interpretation of 
the results. 

Material and methods 

Samples and sequencing 

The genome of a highly inbred Iberian pig, pertaining to 
the Guadyerbas strain, which has been partially se- 
quenced (1% of the genome) in a previous study [10], 
was shotgun sequenced using Illumina Hiseq2000. We 
run one 100 bp PE lane and one 100 bp SE lane. In 
addition, we also sequenced a reduced representation li- 
brary (RRL) of a pool comprising nine sows (equal con- 
centrations of each) from the most representative 
Iberian varieties in Spain: Retinto, Mamellado, Torbiscal, 
Guadyerbas, Entrepelado and Lampino. All sequenced 
sows are registered in the Iberian Herd Book and were 
sampled from well accredited farms that have kept pure- 
bred Iberian pigs without intercrossing with 'foreign' 
breeds. The method to construct the reduced represen- 
tation library is described elsewhere [10]. For the pool, 
Illumina GAIIx technology of 50 bp was employed, and 
2 PE lanes were available. As outgroup, we shotgun se- 
quenced a Potamocherus porcus male using Hiseq2000 
(three PE lanes, 100 bp long) in order to measure diver- 
gence and asses ancestral alleles so that we can apply more 
powerful tests to detect selection (HKA, Fay-Wus H). 

We were able to delineate the boundaries between 
PAR and NPAR because of read depth differences in 
males along SSCX (unpublished data). The SSCX PAR 
occupies the first 6.7 Mb and the last 400 kb of SSCX, 
approximately. Although assembly 10.2 separates two 
telomeric PARs, linkage analyses using genotyping data 
from the 60k SNP chip in an Iberian x Landrace cross 
and results from Burgos-Paz [45] suggest that a single 
PAR exists - as in most mammals. We therefore pooled 
the results from the two annotated PARs in the analyses 
reported here. 

Alignment and SNP calling 

Reads were mapped against the latest reference genome 
(assembly 10.2) using BWA [46], allowing 7 mismatches 
and filtering by mapping quality of 20. P. porcus reads 
were aligned disregarding the paired end structure, i.e., 
they were aligned as SE. This was done to minimize the 
possibility that structural changes between the two spe- 
cies prevent alignment. A total of 345M reads were 
aligned, resulting in an average depth of 20x (3-50x) and 
1.6 GB of the S. scrofa genome assembled. 

SNP calling for the Guadyerbas individual was 
performed using samtools mpileup option [47] filtering 
by minimum depth of 3x, maximum depth of 20 x and 
SNP quality of 20. SNP calling in the Iberian pool was 
done using SNAPE (http://code.google.com/pZsnape- 



pooled/) [48], setting divergence to 0.01, prior nucleotide 
diversity 0.001, folded spectrum and filtering by a pos- 
terior probability of segregation > 0.90. The SNAPE ap- 
proach consists in computing the posterior probability 
of SNP frequency being distinct from 0 or 1, given that 
we observed n A alternative alleles and C-n A reference 
alleles, and given prior frequency in the population be- 
ing P(/): 

P(f/n A )ocP(n A /f) P(f) 

Where 

with p being the probability that an allele A is read and 
n, the number of chromosomes in the pool. This prob- 
ability in turn depends on n, k and on whether there is a 
true A in the genome or whether it is the result of a se- 
quencing error. The algorithm considers the geometric 
mean of sequence qualities for every allele read to com- 
pute this probability [48]. In the equation above, we take 
into account the probability that k counts of the allele 
are present in the pool, given that its true frequency is / 
and that, given /<, how many reads A out of n are 
expected. Because some quantities, notably /<, is un- 
known, this is integrated out. For prior p(/), we consid- 
ered the standard neutral model expected frequency, i.e., 
/al// 

Simulation of pooling process 

Although pools are a highly cost-effective strategy, the 
variability uncovered is only a fraction of the true vari- 
ability in the population. We sought to evaluate the 
power and false discovery rate of our experimental de- 
sign by simulation. We simulated 18 chromosomes of 1 
Mb using coalescence with ms program [49] under a 
standard neutral model with nucleotide diversity tt and 
scaled recombination rate p per site = 0.001. For each 
resulting chromosome, the program ART [50] was used 
to generate reads with the built-in profile for Illumina 
paired-end technology of 75 bp-long reads. To simulate 
the pooling process, reads were randomly selected from 
each sequence using an equal proportion from each in- 
dividual. An average depth of 14x was simulated for the 
whole pool in all and reads were aligned with BWA [46]. 
Next, SNPs were called with SNAPE, restricting mini- 
mum and maximum depths to do the calling between 
3x and 30 x as in our real data analyses. Power was com- 
puted as the proportion of true SNPs in the population 
(i.e., before pooling) located within regions of appropri- 
ate depth that were correctly recovered. False Discovery 
Rate (FDR) was the proportion of SNP calls that were 
incorrect. A total of 100 replicates were simulated. 
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Multicopy region detection 

Read depth method [51,52] was applied to identify copy 
number of a region. Basically, we employed the same 
pipeline as in Paudel et al (op. cit, submitted). First, we 
employed mrsFAST [53], an exhaustive mapping tool 
that allows paralog detection, to align reads (allowing 6 
mismatches) against the repeat masked reference gen- 
ome; repeat mask information was obtained from NCBI. 
Average read depth for each non-overlapping lkb bin 
was calculated across the genome and copy number 
(CN) of each unit was predicted based on the average 
read depth across the diploid region. 1:1 orthologous 
genes between human, cow and pig was used to obtain 
read depth across diploid region. Since these regions 
have the same number of copies in 3 relatively distant 
species, we assumed these were conserved in a copy 
number neutral stage. Finally, chained regions in the ge- 
nomes which are > 4kb in length having copy number 
>3 (each bin should have CN > 3 and 1 kb gap was 
allowed), were extracted and declared MCRs. Next gen- 
eration sequencing methods introduce bias in the read 
depth, which is caused by the dissimilar GC content of 
different segments of DNA. To correct this bias, we used 
GC intervals and the average read depth across the dip- 
loid region to find out the correction factor and used 
that factor to correct depth of each 1 kb bins [52]. 

Nucleotide variability estimation and selection tests 

Note that, with next generation sequencing data at low 
depth, nucleotide diversity cannot be simply computed 
dividing the number of SNPs called by the length of se- 
quence assembled. This is because, with shallow depth, 
the two alleles of the same SNP may not be read and be- 
cause of errors in calling SNPs. For the individual, we 
corrected for low coverage as detailed in [10]: 



where S is the raw number of SNPs, L(i) is the length in 
bp of depth i for that window, and P(S\i) is the probabil- 
ity of reading both alleles for depth i [10]. In the case of 
pools, Wattersons theta was computed as in [12]. 
Briefly, we correct by the expected number of chromo- 
somes sampled for each read depth along the window: 

6 W = —, — : (1) 

where L(i) is the length in bp of depth i for that win- 
dow, and Pc( ; | nr(i), nc) is the probability that a set of 
nr sequences randomly extracted from nc possible 
chromosomes contains sequences coming from 



precisely / different chromosomes. Finally, aj is Ewens 
constant i = i,« - 

Definition of low and high variability windows 

Given that over 10% of Guadyerbas windows had no 
SNP, we defined extreme low variability regions for the 
Guadyerbas as those windows devoid of variability and 
with > lOkb assembled. Among these windows, we se- 
lected those of 5% lowest variability in the pool as well, 
with a minimum of 3 kb aligned. In that way, we avoid 
choosing fixed regions in the Guadyerbas strain due to 
drift. We defined extreme high variability regions as the 
5% most variable windows in Guadyerbas and in the 
pool where at least 10 kb (Guadyerbas) and 3 kb (pool) 
were aligned. 

Tajima's D and Fay-Wu's H tests 

Tajimas D test [54] were computed as the normalized 
difference between the average pairwise nucleotide dif- 
ference 6tt and the Watterson estimator, divided by the 
theoretical variance of the same difference in the stand- 
ard neutral model without recombination in pools 
(Ferretti L, Ramos-Onsins SE, Perez-Enciso M: Popula- 
tion genomics from next generation sequencing of 
pooled lineages, submitted). The estimator of 6 based on 
tt was computed as the average pairwise nucleotide di- 
versity across all reads for a given position, averaged 
over all positions and corrected by a multiplicative factor 
2n/(2n-l) [55]. This estimator is unbiased under the 
neutral model. The normalized Fay and Wus H test [56] 
was computed similarly from the standardized difference 
between 6tt and the estimator 6 H based on high fre- 
quency derived alleles. For the estimator 6 H > only sites 
with known outgroup bases were used, and the estima- 
tor was obtained by summing all segregating sites with k 
derived alleles in r reads weighted by the factor k 2 /r(r-l) 
and divided by a factor correcting for the bias (Ferretti 
et al, op cit.). The variances in the denominators are 
evaluated exactly in the limit of short read for the stand- 
ard neutral model without recombination following the 
results of [57] and accounting for the random extraction 
of reads from individuals. Code is available from L. 
Ferretti (luca.ferretti@gmail.com). 

In order to minimize confounding demographic effects 
with selection fingerprints, we calculated the empirical 
joint distribution combining Tajima's D, Fay and Wu's H 
and Wattersons 6 as in [58]. To do so, we sorted the nor- 
malized statistics D, H and 6, the empirical test was 
obtained simply by multiplying the inverse of the ranks 1/ 
M, 2/M, ... 1 of each statistic for each window 1 . . . M, 
and normalizing. A GO enrichment analysis was 
performed with genes within the 1% most extreme 
windows. 
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Additional file 2: Variability (Wattersons's estimate, per bp) inside 
multicopy regions vs. variability of windows containing multicopy 
regions but outside the multicopy region units? 

Additional file 3: Genes within multicopy regions and extreme 
selection tests' windows. MCR genes: genes within multicopy regions; 
Lowest theta shared autosomes: genes within extreme low 0 in 
autosomes and X pseudoautosomal region (PAR) common in the 
individual and the pool; Lowest theta shared non-pseudoautosomal 
region (NPAR): genes within extreme low 0 in X NPAR region common in 
the individual and the pool; Largest theta pool: genes within extreme 
high 0 regions in the pool; Largest theta individual: genes within extreme 
high 0 regions in the individual; Lowest combined test: genes with 
lowest values of the combined Tajima's D- Fay&Wu's H and 0 test; HKA 
excess of differentiation autosomes+PAR: genes within HKA excess of 
differentiation in autosomes and X PAR region; HKA vexcess of 
polymorphism autosomes+PAR: genes within HKA excess of 
polymorphism in autosomes and X PAR region; HKA excess of 
polymorphism NPAR: genes within HKA excess of polymorphism in X 
NPAR region. 

Additional file 4: Correlation across 200 kb windows between 
Tajima's D and Fay - Wu's H statistics in pooled data. Regression 
line is shown in red. 



Hudson-Kreitman-Aguade test 

Multilocus Hudson-Kreitman-Aguade (HKA) tests were 
calculated in the pool using the P. porcus alignment as 
outgroup and following the original algorithm [59]. We 
applied the test dividing the genome in 200 kb windows. 
Then, M+l equations were solved using a bisection algo- 
rithm to calculate the estimates of the M+l parameters 
(M theta values, one per window, plus the time of split 
between species measured in 2Ne generations). Thus, a 
partial HKA test per window was obtained plus the total 
sum of values, where the null hypothesis (stationary neu- 
tral model) is contrasted using M-l d.f. The approach as- 
sumes unlinked windows and it is, therefore, conservative 
because nearby windows are linked. The original HKA for- 
mulae require a n = ,- = i^_il/i and b n = i = i^-il/^ 
constants, which in the case of pooling are unknown. In- 
stead we used the equivalent correction to infer Watterson s 
theta from pools (denominator in eq. 1), whereas bn was 
obtained by interpolation from a n . The HKA function can 
be downloaded from http://bioinformatics.cragenomica.es/ 
numgenomics/people/sebas. In order to identify outlier 
windows we performed a Benjamini-Hochberg [19] mul- 
tiple test correction over the value of the partial Chi-square 
per window using a 5% false discovery rate. 

Annotation and Gene Ontology enrichment analysis 

SNP annotation was performed using the Variant Effect 
Predictor perl script from Ensembl [11] and the Sus scrofa 
gtf annotation file was from Ensembl release 67, the latest 
version and that used in the pig genome publication. 
Gene ontology enrichment analysis was performed 
using FatiGO, a module of Babelomics using the human 
genome as background and converting Ensembl pig IDs 
to Ensembl human IDs. 

Data accessibility 

Aligned reads in bam format are accessible at sequence 
read archive (SRA), http://www.ncbi.nlm.nih.gov/sra (ex- 
periment ID: SRX245748). 

Ethics statement 

Animal manipulations were performed according to the 
Spanish Policy for Animal Protection RD1201/05, which 
meets the European Union Directive 86/609 about the 
protection of animals used in experimentation. 

Additional files 



Additional file 1: Top: Simulated power against depth. Power was 
computed as the number of SNP called by SNAPE software divided by 
the total number of real SNPs in the pool. Depth corresponds to the 
average depth in the pooled data. Bottom: Power against MAF (minor 
allele frequency in the pool). 
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